Hindsight policy gradients
نویسندگان
چکیده
Goal-conditional policies allow reinforcement learning agents to pursue specific goals during different episodes. In addition to their potential to generalize desired behavior to unseen goals, such policies may also help in defining options for arbitrary subgoals, enabling higher-level planning. While trying to achieve a specific goal, an agent may also be able to exploit information about the degree to which it has achieved alternative goals. Reinforcement learning agents have only recently been endowed with such capacity for hindsight, which is highly valuable in environments with sparse rewards. In this paper, we show how hindsight can be introduced to likelihood-ratio policy gradient methods, generalizing this capacity to an entire class of highly successful algorithms. Our preliminary experiments suggest that hindsight may increase the sample efficiency of policy gradient methods.
منابع مشابه
Evaluating political decision makers: With the benefit of hindsight bias?∗
In this paper we investigate the effects of biased decision evaluation in a simple two-period political agency model. We assume that voters are subjected to hindsight bias in their judgment about a politician’s ability to take appropriate decisions. High ability is defined as an informational advantage over voters as to the welfare maximizing policy, creating incentives for low-ability politici...
متن کاملHindsight biased policy evaluation ∗
In this paper we present a political-agency model in which voters exhibit a cognitive deficiency known as hindsight bias: after the uncertainty about an event is resolved, they consider the realized outcome more foreseeable than it actually was. For their reelection decision, voters evaluate the politician’s ability based on the history of observed actions and outcomes. High ability is defined ...
متن کاملMarkov Games: Receding Horizon Approach
We consider a receding horizon approach as an approximate solution to two-person zero-sum Markov games with infinite horizon discounted cost and average cost criteria. We first present error bounds from the optimal equilibrium value of the game when both players take correlated equilibrium receding horizon policies that are based on exact or approximate solutions of receding finite horizon subg...
متن کاملPOND-Hindsight: Applying Hindsight Optimization to POMDPs
We present the POND-Hindsight entry in the POMDP track of the 2011 IPPC. Similar to successful past entrants (such as FF-Replan and FF-Hindsight) in the MDP tracks of the IPPC, we sample action observations (similar to how FFReplan samples action outcomes) and guide the construction of policy trajectories with a conformant (as opposed to classical) planning heuristic. We employ a number of tech...
متن کاملOvercoming Exploration in Reinforcement Learning with Demonstrations
Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1711.06006 شماره
صفحات -
تاریخ انتشار 2017